Insurance redlining -- a complete example

12/03/2019

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

Introduction

  • In the following lectures, we will go through a relatively complete example of the process of data analysis, regression, diagnostics, and remediation.

  • Particularly, we will discuss a difficult to analyze question of systematic bias in insurance practices.

  • Insurance, by nature of their business, need to price the cost of covering the risk of damages, based on the probability of these damages.

    • There are many legitimate reasons we can imagine why an insurance company would raise the prices on a driver who frequently violates road rules and is an overly agressive driver;
    • this individual is more likely to cause damages to themsevles or others, and therefore, the price to cover these damages will increase (or insurance will be refused altogether).
  • However, we are also in a society that has historically enforced segregation, unequal rights and unequal access to services.

    • While these were once considered legitimate legal practice, discriminatory business practices are now generally considered immoral and illegal.
    • Moreover, while these direct discriminatory practices have largely ended, they have led to historical and lasting inequality in the communities that have been affected by them.

Introduction – continued

  • We will investigate a complicated question:

    • were insurance companies were applying discriminatory business practices on majority non-white neighborhoods in Chicago,
    • or were their practices justifiable based on standard business practices, e.g.,
    • limiting access or raising prices based on justifiable crime statistics, etc?
  • The term “insurance redlining” refers litterally drawing a red line in a map, that excludes certain neighborhoods from services.

  • In the late 1970s, the US Commission on Civil Rights examined charges by several Chicago community organizations that insurance companies were redlining their neighborhoods.

  • Because comprehensive information about individuals being refused homeowners insurance was not available, the number of FAIR plan policies written and renewed in Chicago by zip code for the months of December 1977 through May 1978 was recorded.

  • The FAIR plan was offered by the city of Chicago as a default policy to homeowners who had been rejected by the voluntary market.

  • Information on other variables that might affect insurance writing such as fire and theft rates was also collected at the zip code level.

Chicago (chredlin) dataset

Map of Chicago neighborhoods Courtesy of Peter Fitzgerald CC BY-SA 3.0
  • The data that we have available for our analysis includes the variables:
    1. race – ethnic composition in percentage of minority;
    2. fire – fires per 100 housing units;
    3. theft – thefts per 1000 population;
    4. age – percentage of housing units built before 1939;
    5. involact – new FAIR plan policies and renewals per 100 housing units;
    6. income – median family income in thousands of dollars;
    7. side – north or south side of Chicago.
  • For reference, the South Side of Chicago has long had a reputation and cultural identity built around communities of working class African, Hispanic, Chinese, Irish, and Polish Americans, amongst others.
  • This is due, in part, to historic segregation measures which by the late 1970s were considered illegal.

The ecological fallacy

  • We note here that we do not actually know the ethnicity of the individuals who are denied insurance in this data set.

  • Rather, we only know the ethnic makeup of the zip code where an individual was denied.

  • This is an important difficulty that needs to be considered carefully before starting our analysis.

  • When data are collected at the group level, we may observe a correlation between two variables.

  • The ecological fallacy is concluding that the same correlation holds at the individual level.

  • For example the ecological fallacy was discussed in a court challenge to the Washington gubernatorial election of 2004;

    • In this election, a number of illegal voters were identified, after the election; their votes were unknown, because the vote was by secret ballot.
    • The challengers argued that illegal votes cast in the election would have followed the voting patterns of the precincts in which they had been cast, and thus adjustments should be made accordingly.
    • An expert witness said that this approach was like trying to figure out Ichiro Suzuki's batting average by looking at the batting average of the entire Seattle Mariners team,
    • since the illegal votes were cast by an unrepresentative sample of each precinct's voters, and might be as different from the average voter in the precinct as Ichiro was from the rest of his team.
    • The judge determined that the challengers' argument was an ecological fallacy and rejected it
    • See (Christopher Adolph (May 12, 2005). “Report on the 2004 Washington Gubernatorial Election”. Expert witness report to the Chelan County Superior Court in Borders et al v. King County et al.)

The ecological fallacy continued

  • Similarly, we can consider the ecological fallacy in US demographic data.
library("faraway")
head(eco)
            usborn income home      pop
Alabama    0.98656  21442 75.9  4040587
Alaska     0.93914  25675 34.0   550043
Arizona    0.90918  23060 34.2  3665228
Arkansas   0.98688  20346 67.1  2350725
California 0.74541  27503 46.4 29760021
Colorado   0.94688  28657 43.3  3294394
  • In the “eco” dataset, we have data per state of the USA on:
    1. the proportion of US born versus naturalized citizens in 1990;
    2. the per capita annual income in USD in 1998;
    3. percentage of residents born in state in 1990; and
    4. population in state in 1990.

The ecological fallacy continued

  • If we plot the proportion of US born versus the mean annual income, there appears to be a strong (anti)-correlation:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(income ~ usborn, data=eco, xlab="Proportion US born", ylab="Mean Annual Income", col=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-2

  • Particularly, it appears that states with a higher number of naturalized citizens correlates with higher per-capita income, whereas states with a lower number of naturalized citizens correlates with lower per-capita income.

The ecological fallacy continued

  • Plotting this again with a trendline from simple regression, we find a reasonable fit to the data:
lmod <- lm(income ~ usborn, eco)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(income ~ usborn, data=eco, xlab="Proportion US born", ylab="Mean Annual Income",xlim=c(0,1),ylim=c(15000,70000),xaxs="i", col=1,  cex=3, cex.lab=3, cex.axis=1.5)
abline(coef(lmod))

plot of chunk unnamed-chunk-3

The ecological fallacy continued

  • And we find that the relationship holds with \( 5\% \) significance:
sumary(lmod)
            Estimate Std. Error t value  Pr(>|t|)
(Intercept)  68642.2     8739.0  7.8547 3.188e-10
usborn      -46018.6     9279.1 -4.9594 8.891e-06

n = 51, p = 2, Residual SE = 3489.54138, R-Squared = 0.33
  • Q: can we conclude, therefore, that naturalized citizens earn more than US born citizens in general?

  • A: no, this is precisely the ecological fallacy.

  • From the US Census Bureau, we know that generally US born citizens earn slightly more than their naturalized peers.

  • It is unreasonable to conclude that naturalized citizens have higher income in general just because there is a correlation at the state demographic level between more naturalized citizens and higher per-capita income.

    • Rather, it is more plausible that many immigrants come to states that have high economic activity already for the economic opportunities available there.

The ecological fallacy continued

  • In our case, with the Chicago insurance data, we have to note likewise the danger of the ecological fallacy.

  • Particularly, we only have aggregate data on the zip code level describing the ethnic composition of different neighborhoods, and the number of FAIR plan policies per neighborhood.

  • Even if there is a strong correlation between neighborhoods that have a high proportion of minority residents and a higher number of fair plans,

    • we cannot make the conclusion that minority individuals in these neighborhoods were disproportionately rejected on the normal insurance market.
    • It will be especially difficult to conclude that there was a systematic bias and discriminatory business practice given the aggregate data.
  • Dealing with this subtlety will be a challenge of this dataset in terms of drawing conclusions.

  • We start by performing some exploratory analysis:

head(chredlin, n=4)
      race fire theft  age involact income side
60626 10.0  6.2    29 60.4      0.0 11.744    n
60640 22.2  9.5    44 76.5      0.1  9.323    n
60613 19.6 10.5    36 73.5      1.2  9.948    n
60657 17.3  7.7    37 66.9      0.5 10.656    n

Exploratory analysis

summary(chredlin)
      race            fire           theft             age       
 Min.   : 1.00   Min.   : 2.00   Min.   :  3.00   Min.   : 2.00  
 1st Qu.: 3.75   1st Qu.: 5.65   1st Qu.: 22.00   1st Qu.:48.60  
 Median :24.50   Median :10.40   Median : 29.00   Median :65.00  
 Mean   :34.99   Mean   :12.28   Mean   : 32.36   Mean   :60.33  
 3rd Qu.:57.65   3rd Qu.:16.05   3rd Qu.: 38.00   3rd Qu.:77.30  
 Max.   :99.70   Max.   :39.70   Max.   :147.00   Max.   :90.10  
    involact          income       side  
 Min.   :0.0000   Min.   : 5.583   n:25  
 1st Qu.:0.0000   1st Qu.: 8.447   s:22  
 Median :0.4000   Median :10.694         
 Mean   :0.6149   Mean   :10.696         
 3rd Qu.:0.9000   3rd Qu.:11.989         
 Max.   :2.2000   Max.   :21.480         
  • We see that there is a wide range of values for “race” and that the values are skewed towards zero versus 100.

  • The third quartile is just over \( 57\% \), indicating that about a quarter of the zip codes in Chicago are majority minority neighborhoods.

    • The max value is almost \( 100\% \), while the minimum is about \( 1\% \).
  • This is a useful fact for our analysis, because if all neighborhoods were homogeneous (approximately equal percentages for all zip codes) we wouldn't be able to distinguish any differences from our aggregate data.

Exploratory analysis – continued

summary(chredlin)
      race            fire           theft             age       
 Min.   : 1.00   Min.   : 2.00   Min.   :  3.00   Min.   : 2.00  
 1st Qu.: 3.75   1st Qu.: 5.65   1st Qu.: 22.00   1st Qu.:48.60  
 Median :24.50   Median :10.40   Median : 29.00   Median :65.00  
 Mean   :34.99   Mean   :12.28   Mean   : 32.36   Mean   :60.33  
 3rd Qu.:57.65   3rd Qu.:16.05   3rd Qu.: 38.00   3rd Qu.:77.30  
 Max.   :99.70   Max.   :39.70   Max.   :147.00   Max.   :90.10  
    involact          income       side  
 Min.   :0.0000   Min.   : 5.583   n:25  
 1st Qu.:0.0000   1st Qu.: 8.447   s:22  
 Median :0.4000   Median :10.694         
 Mean   :0.6149   Mean   :10.696         
 3rd Qu.:0.9000   3rd Qu.:11.989         
 Max.   :2.2000   Max.   :21.480         
  • We see likewise that theft and income are skewed varaibles.

  • Also, the number of FAIR plans has many zeros, including the entire first quartile.

  • This is potentially an issue for our linear model assumptions that we have used so far…

  • We will perform some quick exploratory plots to examine relationships among variables.

Exploratory analysis – continued

require(ggplot2)
ggplot(chredlin,aes(race,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-8

  • Here we see a reasonably strong linear relationship between FAIR coverage and neighborhoods with higher minority composition (though we note the ecological fallacy).

Exploratory analysis – continued

ggplot(chredlin,aes(fire,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-9

  • Here there is also a somewhat strong linear relationship with the number of fires and the number of FAIR coverage plans, though with some strong outliers.

Exploratory analysis – continued

ggplot(chredlin,aes(theft,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-10

  • With theft, we appear to have a point with a high leverage that is influencing the linear fit relationship between the FAIR coverage and theft

    • this single observation has influenced the linear trend, and with great uncertainty (\( 95\% \) confidence interval). Likewise, there could be nonlinear scaling to take account of.

Exploratory analysis – continued

ggplot(chredlin,aes(age,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-11

  • With age, there appears to be a linear relationship with older housing and higher number of FAIR plans;

    • however, in newer neighborhoods, the uncertainty of this relationhsip (\( 95\% \) confidence interval) becomes much wider.

Exploratory analysis – continued

ggplot(chredlin,aes(income,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-12

  • With income, there appears to be some relationship between higher income and a lower number of FAIR plans, but outliers and nonlinearity appear to affect the relationship.

  • Particularly, the high income observation appears to have high leverage, and to produce unphysical (negative) values for the relationship.

Exploratory analysis – continued

ggplot(chredlin,aes(side,involact)) + geom_point(position = position_jitter(width = .2,height=0)) + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-13

  • In the plot of FAIR plans versus the North or South Side of Chicago, we jitter the data points so they don't plot over each other. It is hard to tell the exact effect of the North or South side from this plot.

Exploratory analysis – continued

  • If we focus on the relationship between the minority composition of the neighborhoods versus the number of FAIR plans, we find:
sumary(lm(involact ~ race,chredlin))
             Estimate Std. Error t value  Pr(>|t|)
(Intercept) 0.1292180  0.0966106  1.3375    0.1878
race        0.0138824  0.0020307  6.8361 1.784e-08

n = 47, p = 2, Residual SE = 0.44883, R-Squared = 0.51
  • There is a non-zero parameter, with almost \( 100\% \) confidence — we have almost no doubt that there is some statistical relationship between neighborhoods with high minority composition and higher number of FAIR plans.

  • However, we have to note at this moment:

    1. the ecological fallacy, as mentioned before;
    2. the above is not like, e.g., a partial residual plot and it doesn’t factor out other factors like crime and fire rates.
  • Indeed, there may be a number of confounding variables that may provide a better explanation.

  • It could be legitimate buisness practice on the part of insurance companies to deny normal insurance when a neighborhood has too high of a risk;

    • we need to analyze this to see if we can actually identify a discriminatory practice.

Exploratory analysis – continued

ggplot(chredlin,aes(race,fire)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-15

  • When plotting fire versus race, we see a somewhat nonlinear but slightly positve trend between the neighborhoods of high minority concentration and higher rate of fires.

Exploratory analysis – continued

ggplot(chredlin,aes(race,theft)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-16

  • When plotting theft versus race, we see a trend of slightly higher rate of theft with neighborhoods of higher minority concentration.

  • We note, also, this plot has an extreme outlier that may be important for the analysis – this might be skewing the trend positively in a way that isn't consistent with the overall behaviour of the city.

Exploratory analysis – continued

  • We move to the question of model selection, to determine the best possible predictors for the number of FAIR plans in a neighborhood.

  • However, this is fundamentally an explanatory model in which we want to quantify the effect of the ethnic composition of a neighborhood on the number of plans.

    • If this was a disproportinately influential predictor, compared to e.g., theft and fire (when all values are held equal) then we might build evidence of discriminatory business practices.
  • With this purpose, we will thus emphasize techniques which help analyze the explanatory power of the model.

  • This also indicates an important consideration, where we need to control for certain variables in the model.

    • If, e.g., the parameter for “race” was insignificant in the presence of “income” we would struggle to justify a conclusion of illegal discriminatory practice.
  • In this case, we will belabor some of the model selection (and the sensitivity of the model parameters) so that we can make conclusions and explanations with greater confidence.

  • One change that we will make is to use log of income, as it is often done, due to the skewness of the variable; this is also because money operates practically in society on a logarithmic scale.

    • Practically speaking, moving from 10,000 to 11,000 dollars income is a much bigger difference than moving from 100,000 to 101,000 income, and its practical effect is nonlinear.

Model selection

  • We will begin with an information criterion view on the variables:
library("leaps")
sum_life <- summary(regsubsets(involact~ race + fire + theft + age + log(income) + side, data=chredlin))
sum_life
Subset selection object
Call: regsubsets.formula(involact ~ race + fire + theft + age + log(income) + 
    side, data = chredlin)
6 Variables  (and intercept)
            Forced in Forced out
race            FALSE      FALSE
fire            FALSE      FALSE
theft           FALSE      FALSE
age             FALSE      FALSE
log(income)     FALSE      FALSE
sides           FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: exhaustive
         race fire theft age log(income) sides
1  ( 1 ) "*"  " "  " "   " " " "         " "  
2  ( 1 ) "*"  "*"  " "   " " " "         " "  
3  ( 1 ) "*"  "*"  "*"   " " " "         " "  
4  ( 1 ) "*"  "*"  "*"   "*" " "         " "  
5  ( 1 ) "*"  "*"  "*"   "*" "*"         " "  
6  ( 1 ) "*"  "*"  "*"   "*" "*"         "*"  
  • Here, “race” is the best single predictor model, and appears in all “best” choices of some number of explanatory variables.

Model selection – continued

  • Here, minimizing the BIC, we find that the model with race, fire, theft, and age is favored:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(1:6), sum_life$bic, xlab = "Number explanatory variables", ylab = "Bayesian Information Criterion",  col=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-18

Model selection – continued

  • Adjusted \( R^2_a \) favors the same:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(1:6), sum_life$adjr2, xlab = "Number explanatory variables", ylab = "Adjusted R squared",  col=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-19

Model selection – continued

  • With Mallow's Cp, both the four and five variable model are good, but simplicity would favor four explanatory variables:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(1:6), sum_life$cp, xlab = "Number explanatory variables", ylab = "Mallows Cp",  col=1,  cex=3, cex.lab=3, cex.axis=1.5)
abline(0,1)

plot of chunk unnamed-chunk-20

Model selection – continued

  • Finally, to compare the two nested models, we can check the exclude one t-test
lmod <- lm(involact ~ race + fire + theft + age + log(income), data=chredlin)
sumary(lmod)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept) -1.1855396  1.1002549 -1.0775 0.2875500
race         0.0095022  0.0024896  3.8168 0.0004485
fire         0.0398560  0.0087661  4.5466 4.758e-05
theft       -0.0102945  0.0028179 -3.6533 0.0007276
age          0.0083356  0.0027440  3.0377 0.0041345
log(income)  0.3457615  0.4001234  0.8641 0.3925401

n = 47, p = 6, Residual SE = 0.33453, R-Squared = 0.75
  • and we find log-income to not be significant.

Model selection – continued

  • Plotting median income versus the percent of minority in each neighborhood, we see that these variables are strongly anti-correlated:
ggplot(chredlin,aes(race,income)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-22

cor(chredlin$income,chredlin$race)
[1] -0.7037328
  • This is due, in part, to the historic segregation and inequality in Chicago, and the influence of these two variables can be very difficult to separate.

Model selection – continued

  • The strong anti-correlation of log-income and race makes the variable selection take on a political dimension.

    • While log-income seems to largely be statistically unimportant in the presence of the other variables, if we don't control for it, it might over-state the importance of the race parameter.
  • The best case scenario would be that we would perform the model analysis both with and without the log-income and compare the results

    • however, for sake of example, we will only study the model with log-income included (as in Faraway).

Diagnostics

  • Before we make conclusions, we will perform diagnostics.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,1:1,  col=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-23

  • We should note one possible problem lying in the negative fitted values (number of FAIR plans), where there is structure in this tail.

Diagnostics – continued

  • Partiuclarly, we have that,
lmod$fitted[lmod$fitted < 0 ]
      60611       60656       60638       60652       60655 
-0.05706070 -0.23408790 -0.05930948 -0.28650119 -0.15827619 
  • there are five zip codes with predicted negative values for the number of FAIR plans.

Diagnostics – continued

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,1:1,  col=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-25

  • These small negative numbers are problematic, due in part to the high number of zero values for the FAIR plans;

    • however, barring this problem which is structural due to the response variable, we have otherwise good residuals.
  • Unfortunately, a scale change won't fix this realistically — a square root will make nominal difference, but not enough to justify the loss of interpretability of the response.

    • In this case, we will proceed cautiously, bearing in mind and qualifying our conclusions with the known issues.

Diagnostics – continued

  • The Q-Q plot of standardized resiudals looks very good:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,2,  col=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-26

Diagnostics – continued

  • Again, outside the problematic fitted values, the variance in the square-root, standardized residuals is pretty good.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,3,  col=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-27

Diagnostics – continued

  • Finally, we notice once again the extreme leverage points that may need to be taken into account; on the other hand, the largest standardized residuals aren't large enough for concern:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,5,  col=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-28

Diagnostics – continued

  • To verify the outlier analysis numerically, we can perform the Bonferroni corrected \( 5\% \) signficance test on the Studentized Residuals.

  • Remember, the parameters are \( \frac{\alpha}{2 * n} \) for the correct critical value of the t distribution of \( n-p-1 \) degrees of freedom:

sumary(lmod)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept) -1.1855396  1.1002549 -1.0775 0.2875500
race         0.0095022  0.0024896  3.8168 0.0004485
fire         0.0398560  0.0087661  4.5466 4.758e-05
theft       -0.0102945  0.0028179 -3.6533 0.0007276
age          0.0083356  0.0027440  3.0377 0.0041345
log(income)  0.3457615  0.4001234  0.8641 0.3925401

n = 47, p = 6, Residual SE = 0.33453, R-Squared = 0.75
stud <- rstudent(lmod)
alpha_crit <- qt(0.05/(2 * 47), 40)
  • We compute the critical value as above…

Diagnostics – continued

  • But we find that there are no outliers to the fit of the regression function, when we compensate for multiple hypothesis testing:
abs(stud) > abs(alpha_crit)
60626 60640 60613 60657 60614 60610 60611 60625 60618 60647 60622 60631 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
60646 60656 60630 60634 60641 60635 60639 60651 60644 60624 60612 60607 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
60623 60608 60616 60632 60609 60653 60615 60638 60629 60636 60621 60637 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
60652 60620 60619 60649 60617 60655 60643 60628 60627 60633 60645 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 

Summary of diagnostics

  • We note that there is a structural issue arising due to the high number of zero FAIR plans (the response variable), but that there isn't a realistic choice of scale transformation to fix this.

  • Normality and variance assumptions of the error are otherwise OK.

  • We will probably need to evaluate the effect of points of high leverage in the analysis.

  • We wish thus to study the sensitivity of the main parameter of interest “race”, with respect to the leverage points and controlling for covariates in the analysis.

  • We will begin by looking at partial residual plots, which are used for nonlinearity detection when factoring out for covariates.

    • Specifically, we will evaluate the structural uncertainty of the model in terms of the hypothesis, \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \] is a valid form for the signal.
  • We will also look at the confidence intervals for the parameters.

Evaluating the uncertainty

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=1,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-31

coefficients(lmod)[2]
       race 
0.009502223 
confint(lmod, parm = "race")
           2.5 %     97.5 %
race 0.004474458 0.01452999

Evaluating the uncertainty – continued

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=2,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-32

coefficients(lmod)[3]
      fire 
0.03985604 
confint(lmod, parm = "fire")
          2.5 %     97.5 %
fire 0.02215246 0.05755963

Evaluating the uncertainty – continued

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=3,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-33

coefficients(lmod)[4]
      theft 
-0.01029451 
confint(lmod, parm = "theft")
            2.5 %       97.5 %
theft -0.01598536 -0.004603655

Evaluating the uncertainty – continued

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=4,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-34

coefficients(lmod)[5]
      age 
0.0083356 
confint(lmod, parm = "age")
          2.5 %     97.5 %
age 0.002793968 0.01387723

Evaluating the uncertainty – continued

par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=5,  cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-35

coefficients(lmod)[6]
log(income) 
  0.3457615 
confint(lmod, parm = "log(income)")
                 2.5 %   97.5 %
log(income) -0.4623041 1.153827

Evaluating the uncertainty – continued

  • For each of:
    1. race
    2. fire
    3. age
  • there isn’t a great deal of uncertainty in the effect on the response, and the structure appears to be fine in the partial residuals.
  • We make note, however, that while the confidence intervals for theft are small, the analysis may be strongly affected by the observation of high leverage.

    • Indeed, the coefficient for theft is negative, and this value may be strongly affected by the extreme zip code.
  • This is something we will consider later in the analysis.

Evaluating the uncertainty – continued

  • With the generally good diagnostics, in terms of Gaussianity and constant variance of the errors (with some qualifications) we might make some conclusions.

    • That is, we can put some faith in these p-values and confidence intervals, but this should always be qualified.
  • Loosely speaking, it appears that there is a small, but statistically significant, effect in which neighborhoods with higher concentrations of ethnic minorities have a higher number of FAIR plans, when all other variables are held equal.

  • This effect is not nearly as strong as e.g., the rate of fires with all other variables held equal.

    • However, this indicates that there is a small, but non-random effect where neighborhoods have higher number of FAIR plans based on a variable that shouldn't matter in an insurance application decision.
  • Our analysis is not even nearly complete, and we cannot say if any application for insurance was rejected based upon their ethnic background;

    • nonetheless, this corroborates the claims of the Chicago neighborhood organizations who filed their civil rights lawsuit – this warrants further investigation of the overall effect.

Evaluating the sensitivity of parameters

  • In terms of analyzing the sensitivity of the race parameter with respect to the inclusion or exclusion of other covariates, we can automate this to determine if the effect on the response changes dramatically.

    • This is a larger kind of uncertainty in the model selection itself;
    • if e.g., the parameter changed sign with a different choice of variables, we would have very contradictory explanations of the effect of the concentration of minorities on the number of FAIR plans.
  • This is similar to the case when we used matching in the voting districts in New Hampshire, where we needed to see if an additional covariate was acting as a confounding variable on our analysis.

  • By analyzing the inter-model stability of this parameter, we have a better sense of if this explanation is actually robust.

Evaluating the sensitivity of parameters – continued

  • Suppressing the code to automate this (in Faraway 2nd Edition) we will compare the parameter value and associated p-value for race in each combination of the variables:
                                            beta   pvalue
race                                        0.0139 0.0000
race + fire                                 0.0089 0.0002
race + theft                                0.0141 0.0000
race + age                                  0.0123 0.0000
race + log ( income )                       0.0082 0.0087
race + fire + theft                         0.0082 0.0002
race + fire + age                           0.0089 0.0001
race + fire + log ( income )                0.0070 0.0160
race + theft + age                          0.0128 0.0000
race + theft + log ( income )               0.0084 0.0083
race + age + log ( income )                 0.0099 0.0017
race + fire + theft + age                   0.0081 0.0001
race + fire + theft + log ( income )        0.0073 0.0078
race + fire + age + log ( income )          0.0085 0.0041
race + theft + age + log ( income )         0.0106 0.0010
race + fire + theft + age + log ( income )  0.0095 0.0004
  • The above output shows the parameter for race and the associated p-values for all 16 models.

  • There is some variance in the magnitude of the effect, depending on the other variables we control for, but in no case does the p-value rise above \( 5\% \) or does the parameter change sign.

  • This suggests some uncertainty over the magnitude of the effect, we can be sure that the significance of the effect is not sensitive to the choice of adjusters.

Evaluating the sensitivity of parameters – continued

  • If we suppose that we were able to find models in which the composition of the neighborhood in terms of minorities was not statistically significant, our ability to control for other factors would be more difficult.

  • We would then need to consider more deeply which covariates should be adjusted for and which not.

  • This level of model analysis and selection is at the heart of many real world problems, and we are fortunate as modellers in this case that it is unnecessary.

  • In general, however, this is the reality of real problems you will encounter after this class.

  • Now, we want to analyze the effect of leverage and influence on the model and our interpretation.

  • We recall that there are several influential points, and we wish to determine if their inclusion or exclusion would radically change the interpretation for the parameter for race.

Evaluating the sensitivity of parameters – continued

  • We plot below, using the “dfbeta” function we used in another analysis, the change of the parameter for race when excluding any particular observation:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(dfbeta(lmod)[,2],ylab="Change in race coef", cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0)

plot of chunk unnamed-chunk-36

  • The difference is negligible given the size of the parameter and scale of the response, so we are not especially concerned about the sensitivity of the parameter to observations.

Evaluating the sensitivity of parameters – continued

  • We note that we can also produce a similar plot from the “lm.influence” function, turning the output into a dataframe
diags <- data.frame(lm.influence(lmod)$coef)
  • Using this, we will plot the terms in ggplot2:
ggplot(diags,aes(row.names(diags), race)) + geom_linerange(aes(ymax=0, ymin=race)) + theme(axis.text.x=element_text(angle=90), axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold")) + xlab("ZIP") + geom_hline(yintercept = 0)

plot of chunk unnamed-chunk-38

Evaluating the sensitivity of parameters – continued

  • Likewise, using the diags dataframe, we can produce a two dimensional plot of the differences in each the fire and theft parameters when excluding one of the observations:
ggplot(diags,aes(x=fire,y=theft))+geom_text(label=row.names(diags)) +
  theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-39

Evaluating the sensitivity of parameters – continued

  • As in the earlier diagnostics, we see that the zip codes 60607 and 60610 stick out for high leverage and influence on the parameter values:
chredlin[c("60607","60610"),]
      race fire theft  age involact income side
60607 50.2 39.7   147 83.0      0.9  7.459    n
60610 54.0 34.1    68 52.6      0.3  8.231    n
  • These are both unusually high fire and high theft zip codes – we will remove these two observations simultaneously and see the effect on the paramters:
lmode <- lm(involact ~ race + fire + theft + age + log(income), chredlin,subset=-c(6,24))
sumary(lmode)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept) -0.5767365  1.0800462 -0.5340   0.59638
race         0.0070527  0.0026960  2.6160   0.01259
fire         0.0496474  0.0085701  5.7931 1.004e-06
theft       -0.0064336  0.0043489 -1.4794   0.14708
age          0.0051709  0.0028947  1.7863   0.08182
log(income)  0.1157028  0.4011132  0.2885   0.77453

n = 45, p = 6, Residual SE = 0.30320, R-Squared = 0.8
  • Their exclusion actually makes theft and age no longer significant, though race remains significant.

Evaluating the sensitivity of parameters – continued

  • We have verified that our conclusions are also robust to the exclusion of one or perhaps two cases from the data.

    • If our conclusions depended on including or excluding only one or two observations we would need to be particularly sure of these measurements.
    • In some situations, we might wish to drop such influential cases but this would require strong arguments that such points were in some way exceptional.
  • In any case, it would be very important to disclose this choice in the analysis, and any suppression of the information would be extremely dishonest.

Evaluating predictive power

  • Although this is a model designed for explanation purposes, for completeness in the example, we will look a the predictive power of the model:
x <- model.matrix(lmod)
x0 <- apply(x,2,median)
x0 <- data.frame(t(x0))
# note here that R won't accept the variable that we've rescaled until we re-name it back to its original name... 
colnames(x0)[6] <- "income"
predict(lmod,new=x0,interval="prediction")
          fit       lwr     upr
1 0.003348934 -1.398822 1.40552
predict(lmod,new=x0,interval="confidence")
          fit       lwr      upr
1 0.003348934 -1.225333 1.232031
  • We verify, due to the width of the confidence intervals for the mean and new observations on the median, that the model is worthless for predictions…

Qualifying our analysis: a hypothetical example

  • We have shown very robust evidence for the statistically significant effect of the neighborhood's ethnic composition, with all other variables held equal, having a significant effect in determining the number of FAIR plans in the neighborhood.

  • This doesn't mean conclusively, however, that there was systematic discrimination.

  • If we tried another hypothetical model, supressing theft and the two high influence observations:

modalt <- lm(involact ~ race+fire+log(income),chredlin, subset=-c(6,24))
sumary(modalt)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept)  0.7532557  0.8358798  0.9012   0.37277
race         0.0042061  0.0022757  1.8483   0.07178
fire         0.0510220  0.0084504  6.0378 3.822e-07
log(income) -0.3623825  0.3191620 -1.1354   0.26279

n = 45, p = 4, Residual SE = 0.30919, R-Squared = 0.79
  • the parameter for race is no longer significant.

  • This is just a hypothetical example, because our relatively exhaustive analysis has shown that this is a cherry-picked model, versus the other possible models.

Qualifying our analysis – continued

  • Generally, however, this shows some of the subtlety in statistical analysis like this

    • particularly, independent analyses can lead to a different sequence of models, remediation and refinement, possibly to different conclusions.
  • Part of robust analysis is thus to perform several model selections and the associated diagnostics to deal with the overall uncertainty in the model selection;

    • reasonably we should go back and perform the same steps without log income, for example.
  • The strength of our conclusions will increase if we can show that several different models all have similar interpretations.

  • If there are several reasonable models with different interpretations or conclusions, there is a fundamental issue in the data, and we should be cautious and be skeptical if there is really a signal to be found in the data.

    • A modeler who lacks scientific integrity can similarly explore a large number of models but report only the one that favors a particular conclusion;
    • this is extremely dishonest, and if discovered, often leads to the author being professionally discredited (their work not taken seriously and themselves unemployable).
    • This is likewiswe the case for industrial applications;
    • even if the work is not published, if the analyst consistently makes flimsy models that aren't demonstrably robust, they will be uemployable because they cannot help business operations in any meaninful way.

Qualifying our analysis – continued

  • Generally, we have performed rigorous analysis in the above (pending completion of additional analysis).

  • However, there are still several ways we should be cautious about our conclusions:

    1. Firstly is that this is based on aggregate data – we have assumed that the probability a minority homeowner would obtain a FAIR plan after adjusting for the effect of the other covariates is constant across zip codes.

      If this probability varies in a systematic way (as opposed to random noise about the signal) then our conclusions may be well off the mark.

    2. We have demonstrated statistical significance, but the practical effect is actually fairly small. The largest value of the response (percent FAIR plans) is only \( 2.2\% \) and most other values are much smaller.

      The predicted difference between \( 0\% \) minority and \( 100\% \) minority is about \( 1\% \) , so while we may be confident that some people were affected by discrimination, there may not be so many of them to call this systematic discrimination.

    3. There may be some other latent, confounding variable that we haven’t accounted for that could explain this small difference if included in the model.

    4. We may find other ways to aggregate the data, for which our conclusions will change.

Qualifying our analysis – continued

  • For example, if we fit two separate models for the North Side and South Side of Chicago, we can get very different results:
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "s"),chredlin)
sumary(lmod)
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9946613  1.6949252 -1.1768 0.256470
race         0.0093509  0.0046055  2.0304 0.059285
fire         0.0510812  0.0170314  2.9992 0.008493
theft       -0.0102565  0.0090972 -1.1274 0.276184
age          0.0067778  0.0053061  1.2774 0.219700
log(income)  0.6810543  0.6493336  1.0489 0.309832

n = 22, p = 6, Residual SE = 0.34981, R-Squared = 0.76
  • In the South Side, the parameter for race is no longer significant…

Qualifying our analysis – continued

  • while in the North Side, it is:
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "n"),chredlin)
sumary(lmod)
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7902939  1.7793768 -0.4441  0.66196
race         0.0133200  0.0053903  2.4711  0.02310
fire         0.0246654  0.0154219  1.5994  0.12623
theft       -0.0079383  0.0039815 -1.9938  0.06073
age          0.0090040  0.0046471  1.9375  0.06769
log(income)  0.1666853  0.6233641  0.2674  0.79204

n = 25, p = 6, Residual SE = 0.35115, R-Squared = 0.76
  • Generally, sub-dividing data into smaller and smaller groups, we can dillute the significance.

  • However, as noted before, aggregating data too much causes its own issues – the balance between these is contextual and subjective based on our understanding of the task at hand.

Qualifying our analysis – continued

  • In this context, we might explain the difference in terms of the demographics of the North Side and the South Side:
ggplot(chredlin,aes(side,race)) + geom_point(position = position_jitter(width = .2,height=0)) + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))

plot of chunk unnamed-chunk-47

summary(chredlin$race[chredlin$side=="n"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.80   10.00   21.95   24.50   94.40 
summary(chredlin$race[chredlin$side=="s"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   34.27   48.10   49.80   72.92   99.70 
  • With the South Side having a much higher concentration of minorities, the explanatory power of this variable has less effect in the south side, in the presence of other latent variables.

Qualifying our analysis – continued

Population map of race and ethnicity in Chicago 2010 Courtesy of Eric Fischer CC BY-SA 2.0
  • A visualization of the demographics of Chicago is to the left, where the North Side/ South Side divide is clearly expressed.
  • This is a visualization of the 2010 census, where each dot represents 25 residents.
  • The color coding is as follows:
    1. Red: non-Hispanic Caucasian
    2. Blue: African American
    3. Orange: Hispanic
    4. Green: Asian
    5. Yellow: other

Summary of the complete example

  • We performed initial, exploratory analysis, checking for structure in the data, and in the plots of variables versus the response.

  • We fit a model based on a mix of criterion methods and some qualitative judgement.

  • We performed diagnostics to see how our assumptions might be breaking down – aside from an unavoidable issue with the many zero observations for the FAIR plan, it was a well functioning model.

  • We evaluated the partial residuals, and confidence intervals for each of the parameters in our model.

    • This gave us some indication of the uncertainty of this parameter and the structure of the model with respect to it.
  • As this is an explanatory model, we were interested in “predicting” the parameter for “race”;

    • we wanted to understand the sensitivity of this parameter, so we analyzed the p-value and the parameter value for qualitative differences versus all possible covariates in the data.
    • This illustrated a deeper level of the model uncertainty, though one which we seemed to be relatively robust to.
  • We also demonstrated robustness to the exclusion of observations, and checked formally for outliers that could affect the fit.

  • Just for fun, we saw that it has awful predictive power…

  • We made a final assessment of the uncertainty that we could not account for with our data.

Summary of the complete example – continued

  • After all this analysis, everything we say needs to be hedged with “ifs” and “buts” and qualified carefully.

  • This is the reality of statistical modeling, where causual conclusions are extremely difficult to draw.

  • To quote Farway, quoting Winston Churchill:

    “Indeed, it has been said that Democracy is the worst form of government, except all those other forms that have been tried from time to time.”

  • We might say the same about statistics with respect to how it helps us reason in the face of uncertainty.

    • It is not entirely satisfying but the alternatives are worse.

Missing data

  • Missing data causes both technical data processing and statistical challenges in regression analysis.

  • By creating a framework for different kinds of missing data, we have some mathematical tools for understanding how it will affect our analysis.

  • Certain kinds of missing data are harder than others to treat mathematically – at times, we won't have good tools for filling missing data values, or treating the implicit bias therein.

  • However, certain kinds of missing data can be modeled statistically, along with the uncertainty when treating these cases.

Types of missing data

  • Here is a summary of different missing data regimes we might find ourselves:

    1. Missing cases – this is typically the situation we are in during any statistical study.
      • Particularly, we must infer how the signal will generalize to the population at large, using observations of only a small sub-sample.
      • In special circumstances, if the cases that we don’t observe are not observed due to the signal we are studying, our sample will be biased.
    2. Incomplete values – often times when examining medical outcomes, we will not know the final outcomes of patients before the medical study ends.
      • In this situation, data on participants may still hold useful information but we have to deal with the fundamental incompleteness. Methods to handle this include survival analysis
    3. Missing values – it is quite often that samples will have some of the observations of the response or explanatory variables missing or corrupted.
      • Depending on the type of “missingness” we will have different tools to handle this.

Types of missing values

  • The following types of missing values are distinguished as follows:

    1. Missing Completely At Random (MCAR)
      • The probability of any value being missing is the same for every sample.
      • In this case, there is no bias induced by how the values are missing, though we lose information on the signal.
    2. Missing At Random (MAR)
      • Here we suppose the probability of a value being missing depends on a systematic mechanism with the known explanatory variables.
      • E.g., in social surveys certain groups may be less likely to respond.
        • If we know what sub-group the sample belongs to, we can typically delete the incomplete observation provided we adjust the model for the group membership as a factor.
    3. Missing Not At Random (MNAR)
      • The probability that a value is missing from a sample depends on an unknown, latent variable that we don’t observe, or based on the response we wish to observe itself.
        • E.g., if an individual has something to hide that is embarrassing or illegal, they may quite often avoid disclosing information that would suggest so.
        • This is difficult, and often mathematically intractible to handle.
  • Amongst the above, when the data is MAR, we can adjust based on observed variables and therefore handle missingness and bias mathematically – we will focus on this situation.

A concrete example of missing values

  • If we wish to understand methods of treating missing data, we can take a data set and delete values to compare how conclusions might be affected by our treatment.

  • Prototypically, we will study the Chicago Insurance data once again, but with values missing at random.

library("faraway")
summary(chmiss)
      race            fire           theft             age       
 Min.   : 1.00   Min.   : 2.00   Min.   :  3.00   Min.   : 2.00  
 1st Qu.: 3.75   1st Qu.: 5.60   1st Qu.: 22.00   1st Qu.:48.30  
 Median :24.50   Median : 9.50   Median : 29.00   Median :64.40  
 Mean   :35.61   Mean   :11.42   Mean   : 32.65   Mean   :59.97  
 3rd Qu.:57.65   3rd Qu.:15.10   3rd Qu.: 38.00   3rd Qu.:78.25  
 Max.   :99.70   Max.   :36.20   Max.   :147.00   Max.   :90.10  
 NA's   :4       NA's   :2       NA's   :4        NA's   :5      
    involact          income      
 Min.   :0.0000   Min.   : 5.583  
 1st Qu.:0.0000   1st Qu.: 8.564  
 Median :0.5000   Median :10.694  
 Mean   :0.6477   Mean   :10.736  
 3rd Qu.:0.9250   3rd Qu.:12.102  
 Max.   :2.2000   Max.   :21.480  
 NA's   :3        NA's   :2       

Summarizing missing values

  • Before we saw how many NA values were in the dataset based on the explanatory varible, but we will also want to know which samples have missing values (and how many).
rowSums(is.na(chmiss))
60626 60640 60613 60657 60614 60610 60611 60625 60618 60647 60622 60631 
    1     0     1     1     0     0     0     0     1     1     0     0 
60646 60656 60630 60634 60641 60635 60639 60651 60644 60624 60612 60607 
    1     0     0     1     0     0     0     1     1     0     0     1 
60623 60608 60616 60632 60609 60653 60615 60638 60629 60636 60621 60637 
    0     1     1     0     1     0     0     0     1     0     1     0 
60652 60620 60619 60649 60617 60655 60643 60628 60627 60633 60645 
    0     1     0     1     1     0     0     1     0     0     1 
  • Here there is at most one value missing from each row; likewise, the missing data is basically evenly spaced in the data.

    • If there was a large number of missing values in a few cases, we could likely drop these samples without loss of information.
    • However, in this example, dropping the samples with NA's would delete 20 out of 47 of the cases.

Summarizing missing values

  • We can also view how clumped or dispersed missing values are graphically as a diagnostic:
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
image(is.na(chmiss),axes=FALSE,col=gray(1:0))
axis(2, at=0:5/5, labels=colnames(chmiss), cex=3, cex.lab=3, cex.axis=1.5)
axis(1, at=0:46/46, labels=row.names(chmiss),las=2, cex=3, cex.lab=3, cex.axis=1.5)

plot of chunk unnamed-chunk-51

Deleting missing values

  • Suppose we favor a deletion approach to all samples with missing values.

  • We will compare the full model with all values versus the situation in which we delete missing values:

modfull <- lm(involact ~ .  - side, chredlin)
sumary(modfull)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept) -0.6089790  0.4952601 -1.2296 0.2258512
race         0.0091325  0.0023158  3.9435 0.0003067
fire         0.0388166  0.0084355  4.6015     4e-05
theft       -0.0102976  0.0028529 -3.6096 0.0008269
age          0.0082707  0.0027815  2.9734 0.0049143
income       0.0245001  0.0316965  0.7730 0.4439816

n = 47, p = 6, Residual SE = 0.33513, R-Squared = 0.75

Deleting missing values – continued

  • On the other hand, when we fit with the samples with missing values deleted (the default for “lm”):
modmiss <- lm(involact ~ ., chmiss)
sumary(modmiss)
              Estimate Std. Error t value  Pr(>|t|)
(Intercept) -1.1164827  0.6057615 -1.8431 0.0794750
race         0.0104867  0.0031283  3.3522 0.0030180
fire         0.0438757  0.0103190  4.2519 0.0003557
theft       -0.0172198  0.0059005 -2.9184 0.0082154
age          0.0093766  0.0034940  2.6837 0.0139041
income       0.0687006  0.0421558  1.6297 0.1180775

n = 27, p = 6, Residual SE = 0.33822, R-Squared = 0.79
  • The standard error increases because the estimates are less precise, due to the loss of information.

Single imputation

  • We may consider thus to “fill-in” data into the missing values for various samples – this is known as data imputation.

  • One approach is to fill in values by something “representative” of the known population,

    • e.g., fill in by each variable mean.
(cmeans <- colMeans(chmiss,na.rm=TRUE))
      race       fire      theft        age   involact     income 
35.6093023 11.4244444 32.6511628 59.9690476  0.6477273 10.7358667 
  • However, we don't want to fill in values for the response, as this is what we are trying to model:
mchm <- chmiss
for(i in c(1:4,6)) mchm[is.na(chmiss[,i]),i] <- cmeans[i]
imod <- lm(involact ~ ., mchm)

Single imputation – continued

  • Then, we look at the model summary, based on the imputation of the means:
sumary(imod)
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.0708021  0.5094531  0.1390 0.890203
race         0.0071173  0.0027057  2.6305 0.012245
fire         0.0287418  0.0093855  3.0624 0.004021
theft       -0.0030590  0.0027457 -1.1141 0.272242
age          0.0060795  0.0032079  1.8952 0.065695
income      -0.0270917  0.0316782 -0.8552 0.397791

n = 44, p = 6, Residual SE = 0.38412, R-Squared = 0.68
  • In this case, there isn't just loss of fit, but also qualitative differences in the parameters.

  • Particularly, theft and age have lost significance in this model versus the iterations previously seen.

  • Also, the parameters themselves are smaller in magnitude, describing less “effect” in the signal overall.

Single imputation – continued

  • In this case, we see significant bias induced by the imputation (toward the mean values);

    • this shows how we usually will only consider mean imputation when the number of missing values is small relative to the full population.
  • In the case that there is a small number of missing values for a categorical variable, we can typically model the “missing-value” as a category of its own.

  • We can consider a more sophisticated approach for handling missing values using regression.

  • Particularly, if the variables are strongly (anti)-correlated with each other, there is information in the explanatory variables telling how they vary together (or oppostitely).

Single imputation – continued

  • Recall, for percent variables, it is sometimes useful to make a change of scale to the real line when this is a response.

  • The logit transformation is given as, \( y \mapsto \log\left(\frac{y}{1 - y}\right) \)

  • The “logit” and “ilogit” (inverse) map are available in the Faraway package.

  • We will model the percent ethnic minority in a zip code as regressed on the other variables with the missing cases removed:

lmodr <- lm(logit(race/100) ~ fire+theft+age+income,chmiss)
ilogit(predict(lmodr,chmiss[is.na(chmiss$race),]))*100
     60646      60651      60616      60617 
 0.4190909 14.7320193 84.2653995 21.3121261 
chredlin$race[is.na(chmiss$race)]
[1]  1.0 13.4 62.3 36.4
  • Two of the predictions are reasonable, but two are significantly off.

  • This can be performed for each of the explanatory variables, and while preferable to mean imputation, still leads to bias.

    • In a way, we can start over-fitting to our known values and loose the true variance in the population.

Multiple imputation

  • If we understand that the loss of the variation of the population is the issue with the earlier approaches, we can try to enforce some variance in the imputation mathematically.

    • If we re-introduce variation on the missing value, but only once, this would just be a less-optimal choice for the single imputation as we performed previously.
    • Instead, we will treat this as a re-sampling problem, and re-input multiple cases of perturbed values for the missing term that takes into account the uncertainty of this value.
  • We will create 25 different versions of the data “re-sampled” back with uncertainty for each missing value.

  • The known values will be the same, but we will have “m” different versions of the missing values, drawn from a “Bayesian posterior” estimate…

    • The function output of Amelia will include the “m” different datasets:

Multiple imputation – continued

library(Amelia)
set.seed(123)
chimp <- amelia(chmiss, m=25)
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9

-- Imputation 6 --

  1  2  3  4  5  6  7  8  9 10 11 12 13

-- Imputation 7 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43

-- Imputation 8 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83

-- Imputation 9 --

  1  2  3  4  5  6  7  8  9

-- Imputation 10 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 11 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51

-- Imputation 12 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

-- Imputation 13 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143

-- Imputation 14 --

  1  2  3  4  5  6  7

-- Imputation 15 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25

-- Imputation 16 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22

-- Imputation 17 --

  1  2  3  4  5  6  7  8  9

-- Imputation 18 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54

-- Imputation 19 --

  1  2  3  4  5  6  7  8  9 10

-- Imputation 20 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 127 128 129 130

-- Imputation 21 --

  1  2  3  4  5  6  7  8

-- Imputation 22 --

  1  2  3  4  5  6  7  8

-- Imputation 23 --

  1  2  3  4  5  6  7  8  9 10 11 12 13

-- Imputation 24 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29

-- Imputation 25 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30

Multiple imputation – continued

  • All imputations are stored in the object that is output by the Amelia function:
str(chimp, 1)
List of 12
 $ imputations:List of 25
  ..- attr(*, "class")= chr [1:2] "mi" "list"
 $ m          : num 25
 $ missMatrix : logi [1:47, 1:6] FALSE FALSE FALSE FALSE FALSE FALSE ...
  ..- attr(*, "dimnames")=List of 2
 $ overvalues : NULL
 $ theta      : num [1:7, 1:7, 1:25] -1 0.01738 0.00786 -0.06694 -0.00626 ...
 $ mu         : num [1:6, 1:25] 0.01738 0.00786 -0.06694 -0.00626 -0.10637 ...
 $ covMatrices: num [1:6, 1:6, 1:25] 0.977 -0.517 0.609 0.532 0.506 ...
 $ code       : num 1
 $ message    : chr "Normal EM convergence."
 $ iterHist   :List of 25
 $ arguments  :List of 23
  ..- attr(*, "class")= chr [1:2] "ameliaArgs" "list"
 $ orig.vars  : chr [1:6] "race" "fire" "theft" "age" ...
 - attr(*, "class")= chr "amelia"

Multiple imputation – continued

  • To extract the imputations, we can select any particular list value from the object, extracting the field “imputations”:
str(chimp$imputations, 1)
List of 25
 $ imp1 :'data.frame':  47 obs. of  6 variables:
 $ imp2 :'data.frame':  47 obs. of  6 variables:
 $ imp3 :'data.frame':  47 obs. of  6 variables:
 $ imp4 :'data.frame':  47 obs. of  6 variables:
 $ imp5 :'data.frame':  47 obs. of  6 variables:
 $ imp6 :'data.frame':  47 obs. of  6 variables:
 $ imp7 :'data.frame':  47 obs. of  6 variables:
 $ imp8 :'data.frame':  47 obs. of  6 variables:
 $ imp9 :'data.frame':  47 obs. of  6 variables:
 $ imp10:'data.frame':  47 obs. of  6 variables:
 $ imp11:'data.frame':  47 obs. of  6 variables:
 $ imp12:'data.frame':  47 obs. of  6 variables:
 $ imp13:'data.frame':  47 obs. of  6 variables:
 $ imp14:'data.frame':  47 obs. of  6 variables:
 $ imp15:'data.frame':  47 obs. of  6 variables:
 $ imp16:'data.frame':  47 obs. of  6 variables:
 $ imp17:'data.frame':  47 obs. of  6 variables:
 $ imp18:'data.frame':  47 obs. of  6 variables:
 $ imp19:'data.frame':  47 obs. of  6 variables:
 $ imp20:'data.frame':  47 obs. of  6 variables:
 $ imp21:'data.frame':  47 obs. of  6 variables:
 $ imp22:'data.frame':  47 obs. of  6 variables:
 $ imp23:'data.frame':  47 obs. of  6 variables:
 $ imp24:'data.frame':  47 obs. of  6 variables:
 $ imp25:'data.frame':  47 obs. of  6 variables:
 - attr(*, "class")= chr [1:2] "mi" "list"

Multiple imputation – continued

  • We will fit a model over each of the versions, and try to find a best model over all perturbed datasets by an averaging.

    • Specifically, we will take the average of the parameters as: \[ \begin{align} \hat{\boldsymbol{\beta}}_j \triangleq \frac{1}{m} \sum_{j=1}^m \hat{\boldsymbol{\beta}}_{ij}, \end{align} \] where here each \( i \) represents one of the imputations.
  • In this case, we can estimate the overall standard error in terms of the variance of the parameters \( \beta \) derived over the different versions of the imputed data.

    • Specifically, the combined standard errors are given as, \[ \begin{align} s_j^2 \triangleq \frac{1}{m} \sum_{i=1}^m s_{ij}^2 + var\left(\hat{\boldsymbol{\beta}}_j\left( 1 + \frac{1}{m}\right)\right) \end{align} \] where the variance taken above is the sample variance over the parameters, derived by the imputation.

Multiple imputation – continued

  • The “mi.field” function will make these computations, though we automate the re-fitting of the models with the for loop below:
betas <- NULL
ses <- NULL
for(i in 1:chimp$m){
  lmod <- lm ( involact ~ race + fire + theft +age , chimp $ imputations [[ i ]])
  betas <- rbind ( betas , coef ( lmod ))
  ses <- rbind (ses , coef ( summary ( lmod )) [ ,2])
}
(cr <- mi.meld(q=betas,se=ses))
$q.mi
     (Intercept)        race       fire       theft         age
[1,]  -0.2378402 0.007730138 0.03546648 -0.00873242 0.007218719

$se.mi
     (Intercept)        race        fire       theft         age
[1,]    0.161498 0.002026873 0.008519401 0.004555294 0.002531306

Multiple imputation – continued

  • The t-stastistics can be computed similarly,
cr$q.mi/cr$se.mi
     (Intercept)     race     fire     theft      age
[1,]   -1.472713 3.813825 4.163025 -1.916983 2.851776
  • Here the results are fairly similar, though in this case theft is no longer a significant parameter.

  • Many more techniques exist studying missing data, and this is just an introduction on how to approach this with additional references in Faraway.